176

Applications in Computer Vision

information discrepancy for distillation. γ controls the proportion of discrepant proposal

pairs, further validated in Section 6.5.4.

For each iteration, we first solve the inner-level optimization, that is, the selection of the

proposal, by exhaustive sorting [249]; and then solve the upper-level optimization, distilling

the selected pair, based on the entropy distillation loss discussed in Section 6.5.3. Consid-

ering that there are not too many proposals involved, the process is relatively efficient for

inner-level optimization.

6.5.3

Entropy Distillation Loss

After selecting a specific number of proposals, we crop the feature based on the proposals

we obtained. Most SOTA detection models are based on Feature Pyramid Networks (FPN)

[143], which can significantly improve the robustness of multiscale detection. For the Faster-

RCNN framework in this paper, we resize the proposals and crop the features from each

stage of the neck feature maps. We generate the proposals from the regression layer of the

SSD framework and crop the features from the feature map of maximum spatial size. Then

we formulate the entropy distillation process as follows.

max

Rs

n

p(Rs

n|Rt

n).

(6.87)

Here is the upper level of the bi-level optimization, where m is solved and therefore

omitted. We rewrite Eq. 6.87 and further achieve our entropy distillation loss as

LP (w, α; γ) = (Rs

n Rt

n) + Cov(Rs

n, Rt

n)1(Rs

n Rt

n)2 + log(Cov(Rs

n, Rt

n)),

(6.88)

where Cov(Rs

n, Rt

n) = E(Rs

nRt

n)E(Rs

n)E(Rt

n) denotes the covariance matrix.

Hence, we train the 1-bit student model end-to-end, the total loss for distilling the

student model is defined as

L = LGT (w, α) + λLP (w, α; γ) + μLR(w, α),

(6.89)

where LGT is the detection loss derived from the ground truth label, and LR is defined in

Equ. 6.80.

6.5.4

Ablation Study

Selecting the hyper-parameter. As mentioned above, we select hyperparameters λ, γ,

and μ in this part. First, we select μ, which controls the binarization process. As plotted in

Fig. 6.17 (a), we first fine-tune the hyperparameter μ controlling the binarization process

in four situations: raw BiRes18 and BiRes18 distilled by Hint [33], FGFI [235], and our

IDa-Det, respectively. In general, performance increases first and then decreases when the

value of μ increases. On raw BiRes18 and IDa-Det BiRes18, the 1-bit student performs best

when μ is set as 1e-4. And μ valued 1e-3 is better for the Hint and the FGFI distilled 1-bit

student. Therefore, we set μ as 1e-4 for an extended ablation study. Figure 6.17 (b) shows

that the performances increase first and then decrease with increasing λ from left to right.

In general, IDa-Det performs better with λ set as 0.4 and 0.6. With a variable value of γ,

we find {λ, γ} = {0.4, 0.6} boost the performance of IDa-Det most, achieving 76.9% mAP

on VOC test2007. Based on the ablative study above, we set the hyperparameters λ, γ,

and μ as 0.4, 0.6, and 1e-4 for the experiments in this chapter.

Effectiveness of components. We first compare our information discrepancy-aware (IDa)

proposal selecting method with other methods to select proposals: Hint [33] (using the neck

feature without region mask) and FGFI [235]. We show the effectiveness of IDa on two-

stage Faster-RCNN in Table 6.5. In Faster-RCNN, the introduction of IDa improves mAP